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data. The ALDC and ELDC cores provide the flexibility required to handle 
the many different data compression situations presented in today's 
computer systems. 

This paper describes the important factors associated with compressing 
data with the ALDC algorithm. It examines complications introduced when 
data is structured and discusses enhancements to compression by sharing 
compression contexts between segments. Finally, the new features of the 
ELDC core that address variable-length segments, error recovery, and 
minimization of expansion are presented. 

Factors affecting performance 

The first ALDC products (ALDC-5S, -20S, and -40S) [2] compress data as 
one continuous stream of data. This data is compressed by analyzing 
successive bytes until all bytes have been processed. As each byte is 
processed, it is compared against the most recent bytes in the history. 
As bytes are compared, the ALDC engine keeps track of consecutive 
matching bytes. When the longest matching sequence of bytes is determined, 
a code word is inserted into the compressed data stream that points to the 
location and the length of the matching sequence. If no matches are found, 
the data is. coded as a literal and also inserted into the compressed data 
stream. Compression is realized when byte sequences are replaced by smaller 
code words. The details of the ALDC algorithm are presented in a -companion 
paper [3] by Craft. 

When data is compressed as a continuous stream, no indication of the 
former data structure is maintained in the compressed form. That is to say, 
any boundary in the original data will be indistinguishable in the 
compressed data. To retrieve any bytes within the original data, it will be 
necessary to decompress the entire preceding data structure. 

Figure 1 shows a sample compression. The end marker (EM) denotes the end 
of the compressed data stream. 

The achievable compression performance depends most significantly upon 
the content of the data being compressed. If many long, matching byte 
sequences are encountered, compression performance will be maximized. 
However, if entirely random data dominates, creating only literal code 
words, the compressed data stream will expand 12.5%. Unfortunately, in 
most cases, the application cannot control the nature of the data it is 
working with. 

One factor the application can control is the size of the history. The 
size of the history affects compression: The larger the history, the more 
sequences that are available to be referenced. Although the average 
code-word size does increase slightly as the history depth increases, 
effective compression can increase. In the ALDC cores, the history buffer 
can be configured in sizes of 512, 1024, and 2048 bytes. Figure 2 shows the 
impact of the history size on compression for a group of files known as the 
Calgary Corpus [4], arranged by increasing original file size. The Calgary 
Corpus represents a range of typical f ile 'types that would appear in a 
computer system. In most cases, the larger history enables a higher 
compression ratio. Although this is dependent upon the type of data used, 
diabolical data types are minimally affected by a larger history buffer 
size. 

The final factor that affects compression performance is the size of 
the data to be compressed. The graph in Figure 3 shows a plot of 
compression ratios for the Calgary Corpus collection of files. The content 
of the data modulates compression performance much more than the file size. 
For large volumes of data, the ALDC algorithm is relatively independent of 
the number of bytes processed. Size becomes significant when the byte 
counts approach the depth of the history. This effect is discussed further 
in the next section. 



Segmented compression 



In complex systems requiring higher levels of organization, data is 
separated into smaller, more manageable segments. Segmented data can be 
found everywhere, from communication systems in CSU/DSUs to networkinq 
protocols (Ethernet, Token Ring, and ATM) and even personal computer file 
systems (hard disks, CD-ROMs, and tape drives). Partitioning the Ilia 
permits structure to be preserved during compression and compressed data to 
be multiplexed after compression. Segmented compression divides the raw 
data prior to compression. Each system presents a different set of 
requirements that affect data organization and data compression. 

Figure 4 demonstrates segmented compression. The raw data is partitioned 
into either fixed-length segments or variable-length segments S 
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Intersegment dependencies 



As seen in the discussion of segmented compression, the impact of 
SrSnli?" 1 " 9 ^ 3 J? si g ni ficant. Inherent in that discussion is the 
presumption that all segments can be decompressed independently of the 
others. In some systems, segment ordering is required. A given seoment mav 
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Segmented compression can allow part of the compression context to be 
retained from one compression operation to another. Since it is assumed 
that a relationship exists among segments, retaining access to the 
history buffer between operations significantly improves the compression 
ratio of each segment. 

For example, consider Figure 6, which shows original data partitioned 
into four segments. Assume that the history buffer is reset between 
original segments C2 and C3. In order to decompress segment CI, it is 
necessary to decompress both segments CO and CI, in order. Likewise, it is 
necessary to decompress segments CO, CI, and C2, in order, to access data 
within the third segment. The fourth segment is independent of all previous 
segments, so it may be decompressed by itself. As with compression without 
segments, it is necessary to decompress all of the previous data from the 
last history reset to retrieve a given byte within that segment. 

In Figure 7, paperl is again compressed with various segment sizes. The 
new data represents paperl compressed while retaining the compression 
context between segments. The effect of the end-of-segment marker and 
broken-byte-sequence matches can be seen at very small segment sizes. The 
end-of-segment marker adds thirteen bits to the compressed data stream, 
causing minor expansion. Broken byte sequences result in multiple copy code 
words of smaller copy lengths or literals replacing copy code words of 
larger matches. As segment sizes increase, the compression ratio approaches 
the continuous-stream compression ratio much earlier than independent 
segments do. The result is almost constant compression independent of 
segment size. Allowing the compression context to be retained between 
segments improves ALDC's ability to compress segmented data. 

Extending segmented compression 

As applications using data compression evolve, more intricate data 
structures are required. In late 1997, the Linear Tape-Open (LTO) [5] 
alliance, consisting of the Hewlett-Packard, Seagate, and IBM companies, 
proposed a standard to unite a fragmented tape-drive industry 
The Linear Tape-Open Data Compression (LTO-DC) [footl] specification defines 
format with variable-length segments and provides for methods to minimize 
expansion and recover from system errors. This advanced form of 

^f^i^ 10111119 is su PP° rted by the new embedded lossless data compression 
(ELDC) core. 



In this format, raw data can be partitioned into relevant segments, such 
as blocks, clusters, files, etc. Each raw segment can be compressed to form 
a record, which may be collected into a formatted block. Formatted block 
size is programmable and can be configured to comply with the LTO 
specifications. Additionally, the LTO specification defines the generation 
of decompression access points. Access points within each compressed 
formatted block correspond to a location at which the history was reset. 
Extraction of a record from the compressed data is accomplished by 
decompressing from the preceding access point. During compression, the ELDC 
core tracks record and formatted block boundaries and automatically resets 
the history to create an access point. The location of the access point 
within the compressed data stream is provided through status registers. 

The ELDC core also reduces the overhead for error recovery by dividing 
records into one or more subsegments, known as bursts, as shown in Figure 8 
A compressed burst is the smallest identifiable block of compressed 
data generated by the ELDC core, and is padded to a four-byte boundary. 
Each burst is terminated by an end-of-burst (EOB) control code. The final 
Durst is terminated by either an EOB or the end-of-record (EOR) control 
code. This feature provides the ability to correlate compressed data blocks 
to data bus transfers, independently of the overall record size. If a very 
large record is divided into smaller bursts, a bus parity error within a 
single bus transfer does not necessitate the recovery of the entire record 



Retransmission and compression can occur on the failed burst boundary. 

l5o n^f-™^ ^ expan fi°" is . also addressed by the ELDC core. The 
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compressed data, and the second mode is composed of raw data The ELDC 
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shows the effects of the mode-changing abilities of the ELDC core The 
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decreases to 0.89 1 (12.5% data expansion) as the randomness of the raw 
data approaches 100%. However, when the ELDC scheme-swapping algorithm is 
applied to the same data, the compression ratio stabilizes Lound mas 
the randomness of the raw data increases. S 

Summary 

wt*t • Rasing need to integrate as much function as possible into 

VLSI applications the ALDC and ELDC cores provide flexible, lossless data 
compression solutions. The system integrator must consider the size of the 
history and how the data is managed to optimize compression performance 
hiKorv C controf rh^rnr By ? rovidin 9 automatic da?a segmenLtioTand 
VifnMon ^ ' S f* 00 C ° re P rovides a compression/decompression 

V many sySt ? m environments. The ELDC core extends this 
architecture to work with variable-length segments with burst controls 
error recovery, and expansion minimization to address the requirements' of 

C * Y f ^ The 1 ALDC and ELDC <=ores continue to evolve to meet the 
needs of today's complex systems. 
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Effect of increasing history size. 
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Compression of a record. 
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